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Abstract —Common statistical prediction models often require 
and assume stationarity in the data. However, in many practical 
applications, changes in the relationship of the response and 
predictor variables are regularly observed over time, resulting in 
the deterioration of the predictive performance of these models. 
This paper presents Linear Four Rates (LFR), a framework for 
detecting these concept drifts and subsequently identifying the 
data points that belong to the new concept (for relearning the 
model). Unlike conventional concept drift detection approaches, 
LFR can be applied to both batch and stream data; is not 
limited by the distribution properties of the response variable 
(e.g., datasets with imbalanced labels); is independent of the 
underlying statistical-model; and uses user-specified parameters 
that are intuitively comprehensible. The performance of LFR is 
compared to benchmark approaches using both simulated and 
commonly used public datasets that span the gamut of concept 
drift types. The results show LFR significantly outperforms 
benchmark approaches in terms of recall, accuracy and delay 
in detection of concept drifts across datasets. 


I. Introduction 


A common challenge when mining data streams is that 
the data streams are not always strictly stationary, i.e., the 
concept of data (underlying distribution of incoming data) 
unpredictably drifts over time. This has encouraged the need 
to detect these concept drifts in the data streams in a timely 
manner, be it for business intelligence or as a means to track 
the performance of statistical prediction models that use these 
data streams as input. 


This paper focuses on detecting concept drifts affecting 
binary classification models. For a binary classification prob¬ 
lem, concept drift is said to occur when the joint distribution 
P(Xt,j/t) changes over time where X t £ R. d are the d 
predictor variables at time step t and y t £ {0,1} the cor¬ 
responding binary response variable. Intuitively, concept drift 
refers to the scenario when the underlying distribution that 
generates the response variable changes over time. Popular 
approaches for detecting concept drift identify the change point 
0, El- DDM is the most widely used concept drift detection 
algorithm, that is strictly designed for streaming data 12 . The 
test statistic DDM employs is the sum of overall classification 
error ( P^rror ) and its empirical standard deviation ( Serr or)■ 
DDM focuses on the overall error rate and hence fails to detect 
a drift unless the sum of false positive and false negatives 
changes. An example of such a scenario, is when a 2 x 2 
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thus preserving their overall error rate. This limitation is 
accentuated in imbalanced classification tasks 0, as seen in 
the example. Unfortunately, this failure to detect a drastic drop 
in recall of the minority class is often critical. For instance. 


if the minority class in the above example corresponded to 
products at a manufacturing plant that were classified as 
defective, this critical threefold decrease in ’true positive rate’ 
(i.e., from 0.75 to 0.25) would go unnoticed by DDM. 

Drift Detection Method for Online Class Imbalance (DDM- 
OCI) addresses the limitation of DDM when class ratio is im¬ 
balanced & However, DDM-OCI triggers a number of false 
alarms due to an inherent weakness in the model. DDM-OCI 
assumes that the concept drift in an imbalanced classification 
task is indicated by the change of underlying true positive 
rate (i.e., minority-class recall). This hypothesis unfortunately 
does not consider the case when concept drift occurs without 
affecting the recall of the minority class. It can be shown that 
it is possible for concept to drift from an imbalanced class data 
to balanced class data, while true positive rate (tpr), positive 
predicted value (ppv ) and FI-score remain unchanged. Thus, 
this type of drift is unlikely to be detected by DDM-OCI unless 
other rates such as true negative rate ( tnr ) or negative predicted 
value (npv) are also considered. Additionally, the test statistic 
used by DDM-OCI R^) r is not approximately distributed as 

A f{Ptpri tpr ^ -under the stable concept. Thus, the 

Ntpr 

rationale of constructing confidence levels specified in 0J is 
not suitable with the null distribution of RtJ r . This is the 
reason DDM-OCI triggers false alarms quickly and frequently. 

Early Drift Detection Method (EDDM) achieves better 
detection results than DDM if the data stream has slow 
gradual change. EDDM monitors the distance between the two 
classification errors 0- PerfSim algorithm considers all the 
components of a confusion matrix and monitors the cosine 
similarity coefficient of all components from two batches 
of data 0. If the similarity coefficient drops below some 
user-specified threshold, a concept drift is signified. However, 
EDDM requires to wait for a minimum of 30 classification 
errors before calculating the monitoring statistic at each de¬ 
cision point. That is, the length of a time interval between 
decision points of a drift is a random number depending on 
30 appearances of classification errors. It is possible that there 
is a great many examples between 30 classification errors. 
PerfSim algorithms is also constrained by the requirement for 
collecting mini-batch data to calculate monitoring statistics. 
The method to partition data stream in 0, 0 is either user- 
specified by practical experience or to be learned before the 
start of detection. Hence, EDDM and PerfSim are not well 
suited for streaming environments in which decisions are made 
instantly. The approach specified in 0 makes use of SVM to 
monitor three measures: overall accuracy, recall, and precision 
over time. This aproach too computes the three measures by 
assuming that the data arrives in batches, on which SVM is 



learned. 


III. Concept Drift Detection Framework 


To address the limitations of existing approaches, we 
present Linear Four Rates (LFR) for detecting the drift of 
P(X t ,y t ). Unlike other proposed approaches, LFR can detect 
all possible variants of concept drift, even in the presence 
of imbalanced class labels, as shown in Section [TV] LFR 
outperforms existing approaches in terms of earliest detection 
of concept drift, with the least false alarms and best recall. 
Additionally, LFR does not require the data to arrive in batches 
and is independent of the underlying classifier employed. 

II. Problem formulation 

Given that detection of concept drift is equivalent to 
detecting a change-point in P(X t , y t ), an intuitive approach is 
to test the statistical hypothesis upon the multivariate variable 
(X t ,y t ) in the data stream ( 6 ), Q, ( 8 ). The limitation of 
this approach is that the performance of the statistical power 
degrades when the dimension (d) of X t is extremely large 
or if the magnitude of the drift small. Hence, to overcome 
these limitations, the proposed approach identifies the change 
in P(f(lK t ),y t ) where / is the classifier used for prediction. 
This is motivated by the fact that any drift of P(/(X t ),y t ) 
would imply a drift in P(X t ,y t ), with probability 1. 

Let /(Xj) = yt be a binary classifier for the given data 
stream (X t , y f ). We define the corresponding 2x2 confusion 
probability matrix (CP) for / to be 


Given the efficacy of the P* (where, * £ 

{tpr,tnr,ppv,npv}) to detect concept drift, the proposed 
concept drift detection framework uses estimators of the rates 
in P* as test statistics to conduct statistical hypothesis testing 
at each time step. Specifially, the framework at each time 
step t conducts statistical tests with the following null and 
alternative hypotheses: 

H 0 : V*, P(estimator of Pf* 1 -*) = P(estimator of P*^) 

Ha : 3*, P(estimator of P| f 1 - > ) ^ P(estimator of pf). 

The concept is stable under H$ and is considered to have 
drifted if // (l is rejected. The idea is to compare the statistical 
significance level of the running test statistic under // (J at 
each time step to the user defined warning (<5*) and detection 
(e*) significance levels. This type of test is called ’’continuing 
test” El and in our problem all time stamps are decision 
points of acceptance or rejection. Then when the concept is 
stable, false alarms on P* will be triggered unnecessarily once 
in every 1/e* time steps in the long run. In this paper, we 
assume the spacing of decision points is fixed. Accordingly, 
the familiywise error rate and its cost in our continuing test 
can be controlled by using a simultaneous inference method 
such as classical Bonferroni corrections on e*. In a more 
general case where the spacings of decision points are unequal 
and test statistics are strongly positive correlated, we should 
instead consider the average run length of the test Qo) or more 
powerful alternatives that controls the familywise error rate. 



where, CP[ 1,1], CP[0,0], CP[1,0], CP[0,1] denotes 
the underlying percentage of true positives (TP), true nega- 
tives(TN), false positives (FP) and false negatives (FN) re¬ 
spectively, for classifier /. i.e., CP[ 1 ,1] = P(t/t = 1, yt = 1)- 

The four characteristic rates (True Positive Rate, True 
Negative Rate, Positive Predicted Value, Negative Predicted 
Value) can be computed as follows: P tpr = TP/(TP + FN), 
P tnr = TN/(TN + FP), P ppv = TP/(FP + TP) and 
P npv = TN/(TN + FN). All the mentioned characteristic 
rates in P* = { P tpr , P tnr , P ppv , P np v} are equal to 1, if there 
is no misclassification. 

Under a stable concept (i.e., P(X. t ,yt) remains un¬ 
changed), {Pt P r, Ptnr, Pppv, Pnpv} remains the same. Thus, 
a significant change of any P*, implies a change in underlying 
joint distribution (yt.,yt), or concept. It is worth noting that 
at every time step t, for any possible ( yt,yt .) pair, only two 
of the four empirical rates in P* will change and these two 
rates are referred to as “influenced by (y t , yt)”- Also, note 
that in certain applications the detection of concept drift is 
not of interest and thus unnecessarily alarmed if all empirical 
rates in P* are increasing. This is because it suggests that an 
old model learned from historical data performs even better 
in classifications of current data stream. We do not use this 
assumption in this paper, but all methodologies and arguments 
we propose below can be easily adapted for this assumption. 


A naive implementation of the ’’continuing test” framework 
(Naive Four Rates) would be to use P^ (empirical rate of 
p|^), as the estimators and test statistics. But as shown in 
Section III-C there are better estimates of P** ) . 

In the following section, Linear Four Rates (LFR) algo¬ 
rithm will be used to elaborate on the concept drift detection 
framework. LFR differs from Naive Four Rates (NFR) in terms 
of the estimator used. However, both LFR as well as NFR 
perform better than DDM and DDM-OCI due to the more 
comprehensive detection framework utilized. 


A. Linear Four Rates algorithm (LFR) 

1) Algorithm Outline: LFR uses modified rates P 


( t) 


(t) R (t) 


is a modified version of 

(t) 


the test statistics for P, 

the empirical rate Pj ( \ At each t, R! 1 is updated as 
P* ) £- ? 7 *P* _1) + (1 - 7y*)l{j /t=St } for those empirical 
rates * “influenced by (yt,yt)”- R *' 1 is essentially a linear 
combination of classifier’s previous performance It! 11 and 
current performance where 77 * is a time decay factor 

for weighting the classifier’s performance at current instance. 
R { J ] has been used as a class imbalance detector and as a 
revised recall test statistic in mm. The probabilistic char¬ 
acteristic of our test statistic P* are investigated in § III-A2| 




The pseudocode of the framework (using P* as an estimator 
of t! 1 for required test statistic), is detailed in Algorithm [l] 


The three user defined parameters are the time decaying 
factor (//*), warning significance level (8+) and detection 
significance level (e*) for each rate. Time decaying factor is 











Algorithm 1 Linear Four Rates method (LFR) 

Input: Data: {(X t ,y t )}^ :1 where X t £ R d and y t £ {0,1} 
Binary classifier /(•); Time decaying factors 77 *; Warn 
significance level <5*; Detect significance level e*. 
Output: Detected concept drift time ( t c d ). 

1 : Pi ; 0 ' 1 4 — 0.5, f?i°' 4 — 0.5, where * £ {tpr,tnr,ppv,npv} 
and confusion matrix C^ 4— [1,1; 1,1]; 
for t = 1 to 00 do 
y t 4r- /(X t ) 

for each * £ {tpr,tnr,ppv,npv} do 
if (* is influenced by ( yt,{jt )) then 
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else 


V*R* ^ + (1 - 7 7*)l{s/ t =g t } 


f?i 4) <- Ri t ~ 1) 

end if 

if ( * £ {tpr,tnr}) then 

N* 4— [0, l{* = i pr }] + [1, l{*=t pr }] 

p(t) [l{*=tpr} , l{*=tpi-}] 

iV* 

A* ■£- CO)[l{* =P p„},0] + C^[ l{*=pp lI }, 1] 
6(i) , ' l [l{*=ppii}; l{*=pp«}] 


else 


end if 


TV* 


warn.bd* 4— BoundTable(Pi t ' 1 , 77 *, <5*, TV*) 
detect.bd* 4— BoundTable(P{ t \ 77 *, e*, TV*) 

end for 

if (any P^ exceeds warn.bd* & warn.time = 0 ) then 
warn.time 4— t 

else if (no Rp 1 exceeds warn.bd*) then 
warn.time 4— 0 

end if 

if (any Rp ] exceeds detect.bd* ) then 
detect.time 4— t\ 

relearn /(•) using {(X t ,j/ t )}g e w c ^”f me 
reset Rp\ PPpC^ as done in Step 1 
return t cc i 4— t 

end if 
end for 


a weight in [ 0 , 1 ] to evaluate performance of classifier / at 
current instance prediction /(Xt). Given that the detection 
methodology is conducting hypothesis testing at each time 
step, 5* and e* are interpretable statistical significance lev¬ 
els, i.e., type I error (false alarm rate), in standard testing 
framework. In practice, allowable false warning rate and false 
detection rate in applications such as quality control of the 
moving assembly line are guidelines to help the user choose 
the parameters 5* and e*. For the fair comparison, 77 * is set 
to the same value of 0.9 as in for all experiments of this 
paper. The optimal selection of 77 * is domain dependent and 
can be pre-leamed if necessary. 

Theorem Q] in Section IIII-A2I shows that under the stable 
concept, P* is a geometrically weighted sum of i.i.d Bernoulli 
random variables, which emphasizes the most recent prediction 
accuracy and places exponentially decaying weights on the 
historical prediction accuracies. By taking advantage of this 


weighting scheme, Rp 1 is more sensitive to concept drifts, 
foreshadowing the non-stationarity of classifier’s performance. 

Standing on Theorem [T] we are able to overcome the 
shortcoming of ED and construct a more reliable running 
confidence interval for Rp 1 to control the type-I error e*. 
Rp 1 is distributed as geometrically weighted sum of Bernoulli 
random variables. Bhati et. al investigates the closed-form 
distribution function of RP 1 for the special case P* = 0.5 E2. 
However, a closed-form distribution function for other values 
of P* is unattainable. Alternatively, according to Theorem |T] 
a reasonable empirical distribution can also be independently 
obtained by Monte Carlo simulation for given P*, TV* and 
time decaying factor 77 . The pseudocode for the Monte Carlo 
sampling procedure is provided in Algorithm [2] As P* is 
unknown, P* is used as its surrogate to generate the empirical 
distribution of RpP Based on the empirical distribution, the 
lower and upper quantile for the given significance level a, 
serves as the required (waming/detect) bounds. The selection 
of P* as the best surrogate of P*, is supported by Lemma [l] 

(5* and e* denote warning and detection significance levels 
respectively, where <5* > e*. The corresponding warn.bd 
and detect.bd are obtained from Monte Carlo simulations as 
described. The bounds of four rates { tpr , tnr,ppv , npv} of the 
framework, can be independently set based on importance, by 
having distinct e*. For instance, in some imbalanced classifica¬ 
tion tasks, performance of the classifier on the minority class 
is a higher priority than on the majority class. 

Having computed the bounds, the framework considers that 
a concept drift is likely to occur and sets the warning signal 
( warn.time 4— t), when any Rp 1 crosses the corresponding 
warning bounds ( warn.bd ) for the first time. If any P* 
reaches the correspoinding detection bound ( detect.bd ), the 
concept drift is affirmed at ( detect.time 4— t). 

All examples stored between warn.time and detect.time 
are extracted to relearn a new classifier since the stored exam¬ 
ples are considered samples of the new concept. In case the 
number of stored examples is too few to relearn a reasonable 
classifier, one will have to wait for sufficient training examples. 
However, if Rp ! cross the corresponding warning bounds 
warn.bd. but fail to reach detect.bd, previous warning flag 
will be erased. After detecting concept drift, Rp\Pp\C W 
are reset to their initial values, so that a new monitoring cycle 
can restart. 

2) Analysis: The following theorems investigate the statis¬ 
tical properties of LFR test statistic RpP 

(T) 

Theorem 1: For any *, P* is a geometrically weighted 
sum of Bernoulli random variables, when there is a stable 
concept up to time T: i.e., Pi T) = (1 — 

where {p}^ Bernoulli{PP) and P* is the underlying 
rate. 


Proof: Among total T time steps, suppose Rp ] is changed 
according to line 7 at time step T\,..., Tjv, where r I\ < T -2 < 












Algorithm 2 Generation of BoundTable in LFR algorithm 

Input: Estimate of underlying rate P; Time decaying factor 
77 ; Significance level a; Number of time steps Ay; Number 
of random variables num.of.MC', 

Output: Numeric bound for significance level a. 

l: for j = 1 to num.of.MC do 

2: Generate TV* independent Bernoulli random variables 

... ,/jv*} where Bernoulli(P) 

3: R\j] <-{l-r]) E^*r V N *~^i 

4 : end for 

5: {R[j]}"ff'°^' AIC forms a empirical distribution F(R) 
, find a—level quantile as the lower bound lb ■£- 
quantile(P(R),a) and (1 — a)—level quantile as the 
upper bound ub £- quantile(F(R ), 1 — a) 


■ ■ • < T Nt < T. Hence, 

r (T) =r (T n *) = VirR lTs t -i) + (1 _ 7 h) 1 { y TN * = $ Tn J 

=77*[77*pf N *- 2) + (1 - gf)l{y TNii -i = yr Nt - 1 }] 

+ (1 - V*)HyT Nt =Vt n J 
=??*Pi Tjv *" 2) +t?*(l - 77*)l{2/ TjVi -i = yTjv*-i} 

+ (! - V*)l{yT Nt =Vt n J 


N * 

=(1 - *?*) = vt} 

2=1 

AT* 

=( 1 - 77 *)^^*-^ 

j=i 

where the last equation hold by the stable concept assumption 
and all indicators are i.i.d Bernoulli random variables with 
underlying rate P*. ■ 

Lemma 1: Assume the setting in Theorem |T] Under the 
stable concept, for any * £ {tpr,tnr,ppv,npv}, Pi 1 " 1 is 
the unique Uniformly Minimum Variance Unbiased Estimator 
(UMVUE) of P*. As T -4 00 , Pi 1 " 1 is approximately dis¬ 
tributed as AV(P*, P *^~ P *) ). 

- (T) 

Proof: P* is an unbiased estimator of P*. This is 

because P*^ = ——— where {X 77 are i.i.d Bernoulli 
random variables realized at time T t with parameter P*. By 
factorization theorem, P* is a sufficient statistic. Also, 




i =0 


£(v(A (T) )) =£ ()P‘(i - P*) N '- l g(^) 


TV* 


AT* 


=TV*!(1 - P*)^* V ( P * y 

[ ^T!(TV*^T)! l l-P* j 


If E(g(Pi T '*)) = 0 VP*, it implies g( —) = 0 Vi because 

E{g{Pi T ' > )) is a polynomial of-V— Thereby P(g(P| T ^) = 

~( T ) 1 _ P * 

0) = 1 and P* Ms a complete sufficient statistic by definition. 


~(T) 

By Lehmann-Scheffe Theorem, P* ’ is the unique UMVUE. 


The complexity of Linear Four Rates (LFR) detection 
algorithm is 0(1) at each time step. The LFR algorithm 
can be optimized by using a BoundTable precomputed by 
Algorithm [2] The 4 dimensional BoundTable with varying 
input (P, 77 , <5, TV*) can itself be precomputed and stored be¬ 
fore running Algorithm [I] It is unnecessary to spend any 
computational resource on quantiles calculation during stream 
monitoring because observer can find a closest P to P* <! from 
BoundTable to look up lower and upper quantiles. Thus, LFR 
algorithm takes 0 ( 1 ) to test drift occurrence at each time point 
and suits with streaming environment. 


B. Naive Four Rates algorithm (NFR) 


For the purpose of comparison, this section details the 
characteristics of a naive implementation of the proposed 
framework that uses P|as the test statistic. A benefit of 
choosing this test statistic, is that there exists a closed-form 
distribution as shown in Lemma fll Using the same strategy of 

'— ~ (f) 

LFR algorithm, NFR algorithm monitors the four rates P* 
sequentially. At each time stamp, for each rate, hypothesis 

p ( 1 _ pA 

testing is done with null distribution TV(P*, * ^ ——) and 

the warning / detection alarms set when P^ 1 exceeds the 
expected bounds. 


The main difference with respect to LFR is the estimation 
of P* used to find null distribution. LFR algorithm uses Pi *' 1 as 
a surrogate of unknown P* while NFR algorithm uses pf f \ 
where Pi ^ is a running average of all previous P+i This 
update rule allows old prediction performance contributes more 
to the estimate of P* and recent predictions contributes less. 
Thus, P'p is more robust in terms of estimating the underlying 
P* when concept drift occurs. Additionally, Pi ^ is still a 
MSE-consistent estimator under the stable concept presented 
in Lemma [2] 

Lemma 2: Assume the setting in Theorem |T| Under the 

I—I — ( r T'\ 

stable concept up to T, for any * £ {tpr, tnr,ppv, npv}, P* 
in NFR algorithm is a MSE-consistent estimator of P*. 

Proof: Among total T time steps, suppose Pi ^ is changed 
at time step T \,..., Ty, where T\ < T 2 < • • ■ < Tat* < T. 
Hence, 


(T)_ 5 (T N J_ 1 


JV* 


P* ' =P 




(T) 


fife-fall 

n= 1 2=1 

l N+ N* ^ 


* i=l j=i J 


By IID assumption of indicators, we obtain 

N. JV. 

* i —1 7 =, ■> 












and 


LFR vs NFR 


N * N * 1 2 

E(E|)>.(i-R>] 


TV? V ^ 7 
* i=l j=i J 


(E.f=x 7) 


1 , 


< 


J 


< 


TV* 
{logN*) 2 
TV* 


-P*(l-P*) 


P*(l-P*) -^>0 


where the last limit hold by the fact that TV* —F oo as T —F oo. 
Thus, as E(P* (t) ) - P* = 0 and ¥AK(P( r )) -f 0, P ( T ) is a 
MSE-consistent estimator. ■ 


C. Comparison between NFR and LFR 

To empirically compare the test statistics of NFR and 
LFR, we use Figure [T] to illustrate a single run of both 
LFR and NFR algorithm on the same synthetic streaming 
data {(yt,yt)}T=i- The data stream of pairs 
with one change-point at T/2 is generated by sampling from 
two confusion probability matrices CP 1 ' 1 ' 1 and CP^ 2 \ The 
two concepts are characterized by CP ^ and CI )l2> respec¬ 
tively. The type of drift is determined by particular settings 
of (CP( 1 \CP ( ' 2 ' ) ). In this example, to generate a balanced 
stream of pairs {{yt.,yt)}JLi representing the scenario that 
overall accuracy of classifier drops but P tpr remains constant, 
, m ( 0.4 0.1 \ , 9 , ( 0.3 0.1 

we chose CP< > = ( Q1 0 A ) ™ dCP = { 0.2 0.4 

The objective of detection algorithms is to identify the change- 
point T/2. 

It is clear that the test statistic RP in LFR algorithm 
has a larger variance than P* for each rate. LFR algorithm 
reports an earlier detection at t = 5167 (true detection point 
t=5000) when compared to NFR in this run, even though 
This observation matche s well with the 


(LFR) (NFR ) 


rationale of constructing Rp, described in jlll-A 1 to gain 
detection sensitivity through introducing large variances. To 
rigorously compare detection performance of RP and PP, 
more investigations are provided below. 

Power characteristics of two competing test statistics R 
(LFR) and PP (NFR), are compared empirically on synthetic 
data. We denote by /3„(t) and /?□(*) the power estimates of R 


(*) 


W 


and P, 


(t) 


respectively. The and (Fpm against varying time 
lag k and q„ are presented in Figure [2] Figure [2] indicates that 
neither Rp nor PR dominates all the time because Rp (re.d 
surface) achieves a larger statistical power when the time lag 
K is small but a smaller power when K is large. This is 
because the update rule line 7 enables the estimator Rp’ to 
shift from p* to (/* at an exponential rate which leads the 
power dominance in a short lag. The price is that limiting 
distributions of rP under both null and alternative have larger 
variances than PP and thus limiting power, when K is large 
and |p* — ( 7 *| is small, is degraded. 

In order to compare sensitivities of Rp and Pp with 
regard to detecting concept drift in more general settings, we 
used A/3 = Pi.t) — /3p(t>. The result is illustrated in Figure 



Fig. 1. A single run of LFR and NFR on the same synthetic streaming data. 
Black, red and green vertical lines are the ’true drift time’, LFR detection and 
NFR detection time respectively. Four colored dots (black, red, green, blue) 
are running R^* and four colored horizontal lines (indigo, pink, yellow, 
grey) are running where * £ {tpr, tnr, ppv, npv}. 
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Fig. 2. Power comparison between and P** where null distribution 
is at t = M and alternative distribution isatt = M + k. M = 1000, 
1 < k < K where K = 200 is the maximal time lag. The underlying rate is 
drifted from p* = 0.9 to g* where 0.1 < q* < 0.8. 



P* (first concept) ^ (second concept) 

Fig. 3. Power difference A/5 = $ R (t) ~ Pp(t) along the time lag K in 
different combinations of concept change from p* to g*. 














































[3] Except when, /;* = g*, we see that for any fixed pair of 
(p*,g*), A/3 > 0 when K is small and A/3 < 0 when K is 
large. This is because A/3 decreases, as time lag K increases. 
This suggests that LFR is preferable if earlier detection is 
highly desired. The alarms are more likely to be triggered in 
the earliest time after the occurrence of concept drift. Earlier 
detection allows observer to adjust the model and avoid costs 
of incorrect predictions immediately. On the other hand, if 
observers are only concerned with detecting the occurrence of 
drift in the data stream but unconcerned with its detection 
promptness, then NFR algorithm provides a higher power 
test statistic to detect the drift. This is because —> g* 

with convergence rate 0( —). In the long run, as if -> c», 


j(t) 


" I<' 

g* implies that /3 -» 1. 


To guide the selection between LFR and NFR, Figure [4] 
is a heatmap of limiting power estimates on all (p+, g*) pairs 
using K = 200. We can see that is already close to 1 for 
I\ = 200, when p, and q, are significantly different. 


accuracy of the prediction, we use overlapped histograms to 
visualize the distribution of detection points obtained from the 
concept drift detection models across the 100 runs. To avoid 
redundancy, we present 6 histograms out of 10 experiments and 
remaining ones are similar. As shown below, LFR consistently 
outperformed the baseline approaches. When compared to 
NFR, LFR correctly identifies more true drift points with 
higher probability and smaller number of false alarms even 
with a smaller e*. 


A. Synthetic Data 

Numerous experiments were run on synthetic data, cov¬ 
ering various types of concept drift. In each bootstrap, a 
data stream of pairs {{yt,yt)}T=i with one change-point at 
T/ 2 is generated by using the same mechanism introduced in 
§III-C| The objective of detection algorithms is to identify the 
change-point T/2. Six challenging and interesting scenarios 
are discussed below. 



IV. Experiments 

In this section, we compared the detection performance of 
LFR to NFR, DDM and DDM-OCI approaches using both 
synthetic data and public datasets. We considered 3 simulated 
class-balance datasets, 3 simulated class-imbalance datasets 
and 4 public datasets to demonstrate LFR algorithm performs 
well across various types of concept drifts, including those 
where the baseline performs poorly. 

To generalize the performance and evaluate confidences 
of algorithms, we utilize the bootstrapping technique. For 
each synthetic dataset, we generate 100 data streams of 
{{yt,Vt)}T- 1 rather than {(X t , y t )}f=i so that comparison of 
detection algorithms is independent of classifiers employed; 
For each public dataset, the order of (X f ,y t ) pairs within 
each concept are permutated to create 100 bootstrapped dataset 
streams. Each stream is fed to all detection algorithms to 
obtain single-run detections for each method. To illustrate the 


1) Balanced Dataset: In balanced datasets, P(yt = 0) = 
P{yt = 1) is required in underlying data generation. Class- 
balance data are the most typical scenario in classification 
task and hence investigated with following three representative 
experiments. 


(i) 


(ii) 


(hi) 


Balance 1: Overall accuracy of classifier drops but P tpr 

m ( 0.4 0.1 \ 

remains constant with CP' 1 = ( ^ | and 


CP (2) 
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0.2 0.4 ) ' 


Balance2: Gradual drift in which overall accuracy 
(1 — P e rror ) remains the same with C1P (1) = 

( °- 35 °- 05 N CP ( 2 ) = ( 0.4 0.1 \ 
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Balance3: Overall accuracy (1 — P er ror ) 
P tpr remains unchanged with CP ^ = I 


increases and 
0.3 0.2 \ 

0.2 0.3 ) ’ 


CP (2) 


0.4 0.2 \ 

0.1 0.3 ) ' 


2) Imbalanced Dataset: For imbalanced datasets, we used 
the same data generation mechanism as balanced case but 
make P(yt = 0) and P{yt = 1) imbalanced. We considered 
the following three interesting types of concept drifts given 
many attentions to in real applications. 


(i) Imbalance 1: From class balance dataset to class imbal- 

' 1/3 1/6 


ance dataset with CP (1) = 


and CP (2) = 


1/6 1/3 

Without loss of generality, let y = 1 

and 


13/15 1/30 

1/30 1/15 

be the minority class. It is also noteworthy that P tpr 
P ppv are unchanged after drift occurrence. Hence, many 
detectors in imbalance data learning society, using FI 
score as a measure to monitor classifier performance, 
is unable to alarm this type of drift. However, Fig. [7] 
shows that LFR performs very well by dominating both 
high early detection rate and trivial false alarms. Besides, 
DDM and DDM-OCI has no detection after change- 


point due to 
respectively. 
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Fig. 5. Overlapping histograms comparing detection timestamps on Balance 1 
dataset in which overall accuracy of classifier drops but Ptpr remains 
unchanged. Number of counts of LFR is above the top bar of each bin. 



Fig. 7. Overlapping histograms comparing detection timestamps on Imbal¬ 
ance 1 dataset in which class ratio transits from 1:1 to 9:1 but FI—score 
remains unchanged. Number of counts of LFR is above the top bar of each 





Dataset 

T 

True Drift Time 

dimensions ( d ) 

SEA 

60000 

{15000 X i}' 6 i=1 

3 

HYPER. 

90000 

{10000 X i}- = i 

10 

USENET 1 

1500 

{300 X i}J =1 

100 

USENET2 

1500 

{300 X i}i =i 

100 


TABLE 1. Key features of datasets. 


chose the Support Vector Machine (S VM) l(T3l with an RBF 
Kernel as the classifier /, because all detection algorithms 
are independent of type of classifiers. Misclassification of the 
minority class is penalized 100 times more than the majority 
class. If a potential concept drift is reported by the algorithm, 
examples from the new concept will be stored to retrain a new 
SVM classifier f new , adapted with new concept. Specifically, 
1000 examples are used for retraining on SEA and Rotating 
Hyperplane datasets; 100 examples are used for retraining on 
USENET1 and USENET2 datasets. 


Fig. 6. Overlapping histograms comparing detection timestamps on Balance2 
dataset in which gradual drift occurs but overall accuracy remains the same. 
Number of counts of LFR is above the top bar of each bin. 


(ii) Imbalance2: The class ratio and P err or remain unchanged 

0.65 0.05 


but P tpr decreases with CP W = 


0.15 0.15 


and 


CP (2) = 


0.75 0.15 

0.05 0.05 

(iii) Imbalance3: All Pt pr ,Pppv and 1 — P error decreases. 
Though class ratio remains the same, both Fl-score and 
overall accuracy decreases. Two conditional probability 

0.6 0.15 ' 


matrices are selected as CP W = 

CP {2) = 


0.15 0.1 


and 


0.6 

0.15 


0.15 

0.1 


B. Public Datasets 

All detection algorithms are evaluated on four public 
datasets used in literature. Without loss of generality, we 


1) Datasets: SEA Concepts dataset is used in fl4l . The 
dataset is available at http://www.liaad.up.pt/kdus/products/ 
datasets-for-concept-drift and is widely used as a testbed by 
concept drift detection algorithms. Rotating Hyperplane dataset 
is created by El .The dataset and specific (k, t) pairs of each 
concept are available at http://www.win.tue.nl/~mpechen/data/ 
DriftSets/ USENET1 and USENET2 datasets, used in H1 61 . are 
available at http://mlkd.csd.auth.gr/concept_drift.html They 
are stream collections of messages from different newsgroups 
(e.g. medicine, space, baseball) to a user. The difference 
between USENET1 and USENET2 is the magnitude of drift. 
The user in USENET1 has a sharper topic shift than the one 
in USENET2. 

All above datasets are in form of {X t , 2/*}t=i and their key 
features are summarized in Table |U Other details such as the 
imbalance status and type of drift of each dataset are available 
through above links. 

2) Evaluation: In SEA Concepts Dataset experiment. Fig. [8] 
shows that LFR dominates other three approaches in terms of 
early detections and fewer false or delayed detections. 
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Fig. 8 . Overlapping histograms comparing detection timestamps on SEA. 






Fig. 10. Overlapping histograms comparing detection timestamps on 
USENET1. 


Metric 

LFR 

NFR 

DDM 

DDM-OCI 

Balance 1 

6 

77 

36 

304 

Balance2 

13 

19 

33 

339 

Balance3 

18 

54 

11 

219 

Imbalance 1 

18 

81 

16 

259 

Imbalance2 

10 

91 

23 

165 

Imbalance3 

9 

86 

55 

204 

SEA 

72 

32 

54 

658 

HYPRPLN 

84 

56 

73 

826 

USENET 1 

12 

50 

43 

322 

USENET2 

43 

80 

65 

272 


TABLE III. THE COUNT (SUM) AT (MULTIPLE) FALSE DETECTION FOR 
THE SIMULATED (PUBLIC) DATASETS 



time 


Fig. [9] shows that LFR has a dominant performance on the 
Rotation Hypcrplane Dataset experiment. At the second true 
drift time point, the underlying concept change is very minor. 
Hence the drift is neglected by all detection algorithms. 
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Fig. 9. Overlapping histograms comparing detection timestamps on HYPER¬ 
PLANE. 


In USENET1 dataset experiment. Fig. 10 indicates LFR 
dominates other approaches and all drift points are alarmed. 
Similarly, in USENET2 dataset experiment, LFR also out¬ 
performs other approaches but detections are delayed with 
longer time lag. The decrement of superiority of LFR, from 
USENET1 to USENET2 is due to decrements of magnitude 
of concept drifts. 


TABLE II 


Metric 

LFR 

NFR 

DDM 

DDM-OCI 

Balance 1 

38 

12 

4 

12 

Balance2 

16 

3 

0 

11 

Balance3 

25 

4 

0 

3 

Imbalance 1 

95 

59 

0 

4 

Imbalance2 

91 

21 

0 

43 

Imbalance3 

95 

38 

36 

39 

SEA 

142 

29 

17 

26 

HYPRPLN 

671 

598 

345 

149 

USENET 1 

207 

47 

108 

66 

USENET2 

3 

17 

3 

21 


THE COUNT (SUM) AT (MULTIPLE) TRUE DRiFT POINT 


CORRECTLY DETECTED FOR SIMULATED (PUBLIC) DATASETS. 


C. summary statistics 

In general, the best algorithm will have the minimal number 
of false alarms and maximal number of early detections, 
whereas poor algorithms give large number of false alarms, 
missing or severely delayed true detections. A summary of 
the counts of correct detections at true drift timestamp and 
counts of false detections during false detection period for the 


Parameters 

Detect Sig. 

Warn Sig. 

Decay 

LFR 

e* = 1 / 100 K 

<5, = 1/100 

77 * = 0.9 

NFR 

681/1 K 

5* = 0.025 

V * = 0.9 

DDM 

O !detect ~ 3 

a war n = 2 

-3 

II 

p 

b 

DDM-OCI 

O! detect ~ 20 

OL warn = 10 

t?* = 0.9 


TABLE TV. 







































































































































Para. 

SEA 

HYPRPLN. 

USENET 1&2 

LFR 

~t = 1/10 K 

<5* = 1/100 

~el = 1/10 K 

S * = 1/100 

~t = 1/10 K 

St = i/ioo 

NFR 

—* = \JYK~ 

<5* = 0.025 

—* = 1713 ? 

S t = 0.025 

~t = 1/IK 

St = 0.025 

DDM 

^detect — 3 

a war n = 2 

^detect — 3 

a war n = 2 

adetect — 3 
awarn = 2 

DDM- 

OCI 

Ot detect = 20 

a wa rn = 10 

Oidetect — 30 

CX-warn = 10 

Otdetect = 3 

awarn = 2 

1ABLE V. PARAMETER SETTINGS USED IN I) 

IV-B (EXPERIMENTS 


simulated and public datasets are provided in Tables [U] and 
Table |m] 


False detection period refers to the period preceding the 
data points that belong to the new concept. For the synthet¬ 
ically generated datasets in (IV-A there were two concepts 
spanning the T data points, such that the false detection period 


is defined as [0, T/2). For the datasets specified in (IV-B if 
there were more than two concepts, the false detection period 
corresponds to the range from the concept midway up to the 
next true drift point. Each bin in the histograms correspond to 
200 time steps in j IV- A| and dataset-dependent in ; jl V-B| Since 
it has been observed in 03, ED that false alarms may have 
a smaller influence on predictive performance than late drift 
detections, the true detection period in our experiments refers 
to the period spanning next 200 time steps (1 bin) after T/2 in 
S IV-A| an d the period spanning 1 bin after each true drfit point 
in ; jlV-B Other parameter settings of detection algorithms are 
summarized in Table [TV] and [V] They are particularly selected 
to show the dominating performance of LFR, i.e. the smallest 
allowable type-I error but the largest statistical power, over 
benchmark algorithms. 


As sumarized in Table [II] LFR fared best in terms of 
recall of true change point detecion across the various datasets. 
Equally importantly, LFR had the the highest precision with 
regard to detecting change points by producing the least 
amount of false detection and delayed detection (Table |HI|). 


V. Conclusion 

The paper presents a concept drift detection framework 
(LFR) for detecting the occurance of a concept drift and 
identifies the data points that belong to the new concept. The 
versitality of LFR allows it to work with both batch and 
stream datasets, imbalanced data sets and it uses user-specified 
parameters that are intuitively comprehensible, unlike other 
popular concept drift detection approaches. LFR significantly 
outperforms existing benchmark approaches in terms of early 
detection of concept drifts, high detection rate and low false 
alarm rate across the types of concept drifts. 
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